Support for Chinese, Japanese, Korean by patrick-wilken · Pull Request #13 · apptek/SubER

patrick-wilken · 2025-09-03T13:09:50Z

Because of missing word tokenization, SubER so far does not give meaningful scores when calculated directly on "scriptio continua" languages like Chinese and Japanese. The same is true for many of the other available metrics.

Zhou and Yoshinaga noticed and rightfully criticized this in a recent paper

It's already possible to apply tokenization to the subtitle files as a preprocessing step and then call SubER. This is apparently also what was done in the paper, using "SacreBLEU's TER tokenizer" (TercomTokenizer with "asian_support"?). But this is inconvenient, and - more importantly - tokenization should be standardized by making it a part of the evaluation tool. In particular, because different metrics require different tokenization (e.g. BLEU vs TER), including no tokenization (chrF and CER).

I have implemented word tokenization here for all the metrics where necessary. But the choices I made are still to be checked. For the sacrebleu metrics (BLEU, TER, chrF) the situation is pretty clear: we want to pass the correct language information to sacrebleu, but otherwise use its implementation without changes. Even if potentially something is not handled correctly.
For WER, jiwer does not provide built-in tokenization, so we have a free choice. For now, I'm using the TercomTokenizer with "asian_support", because SubER is based on the TER implementation, and WER and TER are very related, so it's the most consistent thing to do. However, I have my doubts that for Japanese the TercomTokenizer does a good job. MeCab, as used for BLEU, seems to be a much more common and much better choice? I will need to talk to native speakers and also colleagues experienced in evaluation of Japanese ASR to confirm.
The same applies to SubER itself. Most consistent would be to use TercomTokenizer, then TER and SubER scores would be closely related also for these languages. But MeCab might be better for Japanese. For Chinese I'm already quite confident it's working well after these changes, I'm getting very similar scores to applying character splitting to the input subtitle files.

For Korean I'm generally not sure "how much" a tokenizer is needed. All subtitle files I have seen use spaces. And the "asian_support" of TercomTokenizer does nothing to Korean text, as far as I can see. But we should definitely pass "ko" as language code to sacrebleu so that BLEU makes use of the MeCab-ko tokenizer. I'm assuming it is supposed to improve handling the rich morphology? Then it might make sense to use for SubER as well. Will need to talk to people with language expertise here as well.
(I might already be embarrassing myself with the Chinese/Japanese unit tests I wrote. 😄)

Hypothesis to reference alignment for AS- and t- metrics is still missing. Should be done using the same tokenizer as for SubER, but needs to be reversed before computing the metrics (which may or may not tokenize themselves again).

This is a breaking change, but should be considered a bug fix.

Still to be checked. Maybe better tokenization than TercomTokenizer is needed for SubER, especially for Japanese? Hypothesis to reference alignment for AS- and t- metrics still missing.

…ean instead of Tercom with "asian_support"

…le tokenization

patrick-wilken · 2026-02-12T14:00:39Z

Hypothesis to reference alignment is now implemented, so AS-BLEU, t-BLEU etc. can be correctly calculated for Japanese, Chinese and Korean.

Also, for SubER and WER I switched to the tokenizers used by SacreBleu for these languages, namely MeCab and "zh", in favor of TercomTokenizer with "asian_support". In particular due to the fact that TercomTokenizer does not split sequences of Hiragana and Katakana characters at all.

I manually tested this for multiple Japanes and Chinese files and I get the same SubER scores as when applying those tokenizers to hypothesis and reference before calling the SubER tool*. There are slight tokenization differences when applying MeCab to space-separated words/parts, like the SubER code does it, as opposed to applying it to full subtitle lines. But the score difference is <0.2 SubER points, and neither method is more correct, I guess.

*(Note, that to reproduce exact SubER scores - as opposed to SubER-cased - one also has to remove punctuation from the tokenized input files, otherwise those pure punctuation tokens will not be normalized away during SubER calculation.)

I also tested that running the align_hyp_to_ref tool and then running sacrebleu on the aligned hypothesis gives the same scores as when calling the SubER tool to compute AS-BLEU (python3 -m suber -l {ja,zh} -m AS-BLEU [...]).

So, quite confident that the implementation is correct, despite being someone complicated. I guess I'll run the same manual tests also for Korean, and also add more unit tests for these languages. But should be good to use already.

patrick-wilken added 5 commits September 2, 2025 08:23

CER: remove non-ASCII punctuation

3614388

This is a breaking change, but should be considered a bug fix.

Add support for Chinese, Japanese, Korean

bec7386

Still to be checked. Maybe better tokenization than TercomTokenizer is needed for SubER, especially for Japanese? Hypothesis to reference alignment for AS- and t- metrics still missing.

SubER and WER: use mecab/"zh" tokenization for Chinese, Japanese, Kor…

842c27c

…ean instead of Tercom with "asian_support"

align_hyp_to_ref tool: fix file format check

359dc13

Implement time- and Levenshtein alignment for ja, ko, zh via reversib…

3ca68dc

…le tokenization

patrick-wilken marked this pull request as ready for review February 12, 2026 14:23

patrick-wilken added 4 commits February 12, 2026 17:00

Rename ASIAN_LANGUAGE_CODES -> EAST_ASIAN_LANGUAGE_CODES

e3dc5bf

README: document support for Chinese, Japanese, Korean

418e281

Add more tests for zh, ja, ko

a7f7f18

pyproject.toml: add ja/ko as optional dependencies

1db4304

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Chinese, Japanese, Korean#13

Support for Chinese, Japanese, Korean#13
patrick-wilken wants to merge 9 commits intomainfrom
feature/cjk_support

patrick-wilken commented Sep 3, 2025 •

edited

Loading

Uh oh!

patrick-wilken commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

patrick-wilken commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

patrick-wilken commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

patrick-wilken commented Sep 3, 2025 •

edited

Loading